In this lecture, we cover the following topics:
from IPython.display import Image
Machine learning: a subfield of AI
ML refers to algorithms that can derive rules for predictions automatically from data
ML is now a very important field of computer science, becoming increasingly important in everyday life.
Image(filename='./images/01_01.png', width=500)
Image(filename='./images/01_02.png', width=500)
Supervised Learning: learn a model from labeled training data, that allows us to make predictions about unseen (future) data points
The term supervised refers to the fact that the desired output labels of training samples are already known
Ex. spam email filtering
Two types:
Image(filename='./images/01_03.png', width=300)
Terminology
Goal: to predict the categorical class labels of new instances based on past observations
Types of classification
Binary classification: distinguish between two possible classes (e.g. spam and non-spam emails)
Multi-class classification: distinguish amongst multiple classes (e.g. handwritten digits from 0 to 9)
Image(filename='./images/01_04.png', width=300)
Goal: to predict continuous outcome
Given:
We try to find a relationship between those variables that allows us to predict the outcome
Image(filename='./images/01_05.png', width=300)
Goal: to develop a system (agent) that improves its performance based on iteractions with the environment.
The environment gives feedback, including a reward signal, which is not the correct ground truth labels or value, but a measure how well the action was measured by a reward function
Ex. a chess engine. The agent decides upon a series of moves depending on the state of the board (the environment), and the reward can be defined as win or lose at the end of the game.
Supervised learning: we know the right label beforehand when we train the model
Unsupervised learning: we deal with unlabeled data. We explore the structure of our data to extract mearningful information without the guidance of a known outcome varible or reward function.
Image(filename='./images/01_06.png', width=300)
Clustering: an exploratory data analysis technique that allows us to organize a pile of information into meaningful subgroups (clusters) without having any prior knowledge of their group memberships.
Ex. discover customer groups based on their interests, in order to develop distinct marketing programs.
Image(filename='./images/01_07.png', width=500)
Image(filename='./images/01_08.png', width=500)
The Iris dataset: 150 iris flowers from three different species (Setosa, Versicolor, and Virginica)
Matrix Vector A feature (data) matrix X: samples as rows, featues as columnes
Image(filename='./images/01_09.png', width=700)
No free lunch theorem David Wolpert, 1996
An ML algorithm is designed to perform well on certain tasks, which requires certain assumptions
There is no such an universal ML algorithm that performs well on all of the tasks.
No single classification model enjoys superiority if we don't make any assumptions about the task.
Therefore, it is essential to compare a handful of different algorithms in order to train and select the best performning model.
Validation set: a subset of the training data, used for e.g. model selection (recall that we do not touch the test data till the end, for performance evaluation regarding future data points)
After we've selected a moddel that has been fitted on the training data, we can use the test dataset to estimate how well it may perform on unseen data to estimate the generalization error
If we're satisfied with the performance, we can use this model to predict new, future data
Important note: feature scaling and dimensionality reduction must be obtained solely from the training dataset, and the same parameters are later applied to transform the test dataset
One of the most popular language for data science
Python itself it not very fast: interpreter-based
NumPy and SciPy libraries are built using Fortran and C